{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Thread\n",
    "A basic element of the data to be processed on the GPU.\n",
    "\n",
    "### CUDA Blocks\n",
    "A collection or group  of threads which can communicate within their own block.\n",
    "### Grid\n",
    "CUDA blocks are grouped into a grid. Blocks are independent of each other.\n",
    "\n",
    "### Kernel\n",
    "A kernel is executed as a grid of blocks of threads.\n",
    "\n",
    "<img src=\"../images/grid.png\" width=\"50%\" height=\"50%\">\n",
    "\n",
    "### Streaming Multiprocessor (SM) \n",
    "Streaming multi-processors with multiple processing cores. Each CUDA block is executed by one streaming multiprocessor (SM) and cannot be migrated to other SMs in GPU. One SM can run several concurrent CUDA blocks depending on the resources needed by CUDA blocks. Each kernel is executed on one device and CUDA supports running multiple kernels on a device at one time. The below figure shows the kernel execution and mapping on hardware resources available in GPU.\n",
    "\n",
    "<img src=\"../images/mapping.png\" width=\"50%\" height=\"50%\">\n",
    "\n",
    "### Warp\n",
    "32 threads form a warp.The SM has a maximum number of warps that can be active at once. \n",
    "\n",
    "### Memory Hierarchy\n",
    "CUDA-capable GPUs have a memory hierarchy as shown below:\n",
    "\n",
    "<img src=\"../images/memory.png\" width=\"50%\" height=\"50%\">\n",
    "\n",
    "The following memories are exposed by the GPU architecture:\n",
    "\n",
    "- **Registers** : These are private to each thread, which means that registers assigned to a thread are not visible to other threads. The compiler makes decisions about register utilization.\n",
    "- **L1/Shared memory (SMEM)** : Every SM has a fast, on-chip scratchpad memory that can be used as L1 cache and shared memory. All threads in a CUDA block can share shared memory, and all CUDA blocks running on a given SM can share the physical memory resource provided by the SM..\n",
    "- **Read-only memory** : Each SM has an instruction cache, constant memory,  texture memory and RO cache, which is read-only to kernel code.\n",
    "- **L2 cache** : The L2 cache is shared across all SMs, so every thread in every CUDA block can access this memory. The NVIDIA A100 GPU has increased the L2 cache size to 40 MB as compared to 6 MB in V100 GPUs.\n",
    "- **Global memory** : This is the framebuffer size of the GPU and DRAM sitting in the GPU.\n",
    "\n",
    "To learn more, please checkout the CUDA Refresher series at https://developer.nvidia.com/blog/tag/cuda-refresher/ .\n",
    "\n",
    "\n",
    "### Occupancy\n",
    "The Streaming Multiprocessor (SM) has a maximum number of active warps  at once. Occupancy is the ratio of active warps to maximum supported active warps. Occupancy is 100% if the number of active warps equals the maximum. If this factor is limiting active blocks, occupancy cannot be increased. \n",
    "\n",
    "The Streaming Multiprocessor (SM)  has a maximum number of blocks that can be active at once. If occupancy is below 100% and this factor limits active blocks, it means each block does not contain enough warps to reach 100% occupancy when the device's active block limit is reached. Occupancy can be increased by increasing block size.\n",
    "\n",
    "\n",
    "To learn more about occupancy, checkout https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Unified Memory\n",
    "\n",
    "With every new CUDA and GPU architecture release, new features are added. These new features provide more performance and ease of programming or allow developers to implement new algorithms that otherwise weren't possible to port on GPUs using CUDA.\n",
    "One such important feature that was released from CUDA 6.0 onward and finds its implementation from the Kepler GPU architecture is unified memory (UM). \n",
    "\n",
    "In simpler words, UM provides the user with a view of single memory space that's accessible by all GPUs and CPUs in the system. This is illustrated in the following diagram:\n",
    "\n",
    "<img src=\"../images/UM.png\" width=\"80%\" height=\"80%\">\n",
    "\n",
    "UM simplifies programming effort for beginners to CUDA as developers need not explicitly manage copying data to and from GPU. We will be using this feature of latest CUDA release and GPU architecture in labs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licensing \n",
    "\n",
    "Copyright © 2022 OpenACC-Standard.org.  This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
